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Abstract 

Networks  arising  from  social,  technological 
and  natural  domains  exhibit  rich  connectiv¬ 
ity  patterns  and  nodes  in  such  networks  are 
often  labeled  with  attributes  or  features.  We 
address  the  question  of  modeling  the  struc¬ 
ture  of  networks  where  nodes  have  attribute 
information.  We  present  a  Multiplicative  At¬ 
tribute  Graph  (MAG)  model  that  considers 
nodes  with  categorical  attributes  and  models 
the  probability  of  an  edge  as  the  product  of 
individual  attribute  link  formation  affinities. 

We  develop  a  scalable  variational  expectation 
maximization  parameter  estimation  method. 
Experiments  show  that  MAG  model  reliably 
captures  network  connectivity  as  well  as  pro¬ 
vides  insights  into  how  different  attributes 
shape  the  network  structure. 

1  Introduction 

Social  and  biological  systems  can  be  modeled  as  in¬ 
teraction  networks  where  nodes  and  edges  represent 
entities  and  interactions.  Viewing  real  systems  as  net¬ 
works  led  to  discovery  of  underlying  organizational 
principles  ei  ng  as  well  as  to  high  impact  applica¬ 
tions  [14] .  As  organizational  principles  of  networks  are 
discovered,  questions  are  as  follow:  Why  are  networks 
organized  the  way  they  are?  How  can  we  model  this? 

Network  modeling  has  rich  history  and  can  be  roughly 
divided  into  two  streams.  First  are  the  explanatory 
“mechanistic”  models  mm  that  posit  simple  gener¬ 
ative  mechanisms  that  lead  to  networks  with  realis¬ 
tic  connectivity  patterns.  For  example,  the  Copying 
model  [7]  states  a  simple  rule  where  a  new  node  joins 
the  network,  randomly  picks  an  existing  node  and  links 
to  some  of  its  neighbors.  One  can  prove  that  under  this 
generative  mechanism  networks  with  power-law  degree 
distributions  naturally  emerge.  Second  line  of  work  are 
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statistical  models  of  network  structure  eh  n  m  nz] 
which  are  usually  accompanied  by  model  parameter 
estimation  procedures  and  have  proven  to  be  useful  for 
hypothesis  testing.  However,  such  models  are  often  an¬ 
alytically  untractable  as  they  do  not  lend  themselves 
to  mathematical  analysis  of  structural  properties  of 
networks  that  emerge  from  the  models. 

Recently  a  new  line  of  work  PICE!]  has  emerged.  It  de¬ 
velops  network  models  that  are  analytically  tractable 
in  a  sense  that  one  can  mathematically  analyze  struc¬ 
tural  properties  of  networks  that  emerge  from  the 
models  as  well  as  statistically  meaningful  in  a  sense 
that  there  exist  efficient  parameter  estimation  tech¬ 
niques.  For  instance,  Kronecker  graphs  model  [10]  can 
be  mathematically  proved  that  it  gives  rise  to  networks 
with  a  small  diameter,  giant  connected  component, 
and  so  on  mu-  Also,  it  can  be  fitted  to  real  net¬ 
works  eh  to  reliably  mimic  their  structure. 

However,  the  above  models  focus  only  on  modeling  the 
network  structure  while  not  considering  information 
about  properties  of  the  nodes  of  the  network.  Often 
nodes  have  features  or  attributes  associated  with  them. 
And  the  question  is  how  to  characterize  and  model  the 
interactions  between  the  node  properties  and  the  net¬ 
work  structure.  For  instance,  users  in  a  online  social 
network  have  profile  information  like  age  and  gender, 
and  we  are  interested  in  modeling  how  these  attributes 
interact  to  give  rise  to  the  observed  network  structure. 

We  present  the  Multiplicative  Attribute  Graphs  (MAG) 
model  that  naturally  captures  interactions  between  the 
node  attributes  and  the  observed  network  structure. 
The  model  considers  nodes  with  categorical  attributes 
and  the  probability  of  an  edge  between  a  pair  of  nodes 
depends  on  the  individual  attribute  link  formation 
affinities.  The  MAG  model  is  analytically  tractable 
in  a  sense  that  we  can  prove  that  networks  arising 
from  the  model  exhibit  connectivity  patterns  that  are 
also  found  in  real-world  networks  [5].  For  example, 
networks  arising  from  the  model  have  heavy-tailed  de¬ 
gree  distributions,  small  diameter  and  unique  giant 
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connected  component  [5!.  Moreover,  the  MAG  model 
captures  homophily  ( i.e tendency  to  link  to  similar 
others)  as  well  as  heterophily  (i.e.,  tendency  to  link  to 
different  others)  of  different  node  attributes. 

In  this  paper  we  develop  MagFit,  a  scalable  parame¬ 
ter  estimation  method  for  the  MAG  model.  We  start 
by  defining  the  generative  interpretation  of  the  model 
and  then  cast  the  model  parameter  estimation  as  a 
maximum  likelihood  problem.  Our  approach  is  based 
on  the  variational  expectation  maximization  frame¬ 
work  and  nicely  scales  to  large  networks.  Experiments 
on  several  real-world  networks  demonstrate  that  the 
MAG  model  reliably  captures  the  network  connectiv¬ 
ity  patterns  and  outperforms  present  state-of-the-art 
methods.  Moreover,  the  model  parameters  have  natu¬ 
ral  interpretation  and  provide  additional  insights  into 
how  node  attributes  shape  the  structure  of  networks. 

2  Multiplicative  Attribute  Graphs 

The  Multiplicative  Attribute  Graphs  model  (MAG)  [5] 
is  a  class  of  generative  models  for  networks  with  node 
attributes.  MAG  combines  categorical  node  attributes 
with  their  affinities  to  compute  the  probability  of  a 
link.  For  example,  some  node  attributes  ( e.g .,  polit¬ 
ical  affiliation)  may  have  positive  affinities  in  a  sense 
that  same  political  view  increases  probability  of  being 
linked  (i.e.,  homophily),  while  other  attributes  may 
have  negative  affinities,  i.e.,  people  are  more  likely  to 
link  to  others  with  a  different  value  of  that  attribute. 

Formally,  we  consider  a  directed  graph  A  (represented 
by  its  binary  adjacency  matrix)  on  N  nodes.  Each 
node  i  has  L  categorical  attributes,  F.^ ,  ■  ■  ■  ,Fi^  and 
each  attribute  l  (l  =  1,  •  •  ■  ,  L)  is  associated  with  affin¬ 
ity  matrix  0;  which  quantifies  the  affinity  of  the  at¬ 
tribute  to  form  a  link  .  Each  entry  Qi[k,  k ']  £  (0, 1)  of 
the  affinity  matrix  indicates  the  potential  for  a  pair  of 
nodes  to  form  a  link,  given  the  l-th  attribute  value  k 
of  the  first  node  and  value  k'  of  the  second  node.  For 
a  given  pair  of  nodes,  their  attribute  values  “select” 
proper  entries  of  affinity  matrices,  i.e.,  the  first  node’s 
attribute  selects  a  “row”  while  the  second  node’s  at¬ 
tribute  value  selects  a  “column” .  The  link  probability 
is  then  defined  as  the  product  of  the  selected  entries 
of  affinity  matrices.  Each  edge  (i,j)  is  then  included 
in  the  graph  A  independently  with  probability  p^ : 

L 

pij:=P(Aij  =  l)  =  l[Ql[Fil,Fjl\.  (1) 

i=i 

Figure [T] illustrates  the  model.  Nodes  i  and  j  have  the 
binary  attribute  vectors  [0,0, 1,0]  and  [0, 1,1,0],  re¬ 
spectively.  We  then  select  the  entries  of  the  attribute 
matrices,  0i[O,  0],  ©2 [0, 1] ,  ©3 [1 , 1] ,  and  ©4[0,  0]  and 
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Figure  1:  Multiplicative  Attribute  Graph  (MAG) 
model.  Each  node  i  has  categorical  attribute  vector 
Fi.  The  probability  p^  of  edge  (i,j)  is  then  deter¬ 
mined  by  attributes  “selecting”  appropriate  the  entries 
of  attribute  affinity  matrices  0;. 


compute  the  link  probability  pij  of  link  (i,j)  as  a  prod¬ 
uct  of  these  selected  entries. 

Kim  &  Leskovec  [5]  proved  that  the  MAG  model  cap¬ 
tures  connectivity  patterns  observed  in  real-world  net¬ 
works,  such  as  heavy-tailed  (power-law  or  log-normal) 
degree  distributions,  small  diameters,  unique  giant 
connected  component  and  local  clustering  of  the  edges. 
They  provided  both  analytical  and  empirical  evidence 
demonstrating  that  the  MAG  model  effectively  cap¬ 
tures  the  structure  of  real-world  networks. 

The  MAG  model  can  handle  attributes  of  any  cardi¬ 
nality,  however,  for  simplicity  we  limit  our  discussion 
to  binary  attributes.  Thus,  every  Fu  takes  value  of 
either  0  or  1,  and  every  0;  is  a  2  x  2  matrix. 

Model  parameter  estimation.  So  far  we  have  seen 
how  given  the  node  attributes  F  and  the  correspond¬ 
ing  attribute  affinity  matrices  0  we  generate  a  MAG 
network.  Now  we  focus  on  the  reverse  problem:  Given 
a  network  A  and  the  number  of  attributes  L  we  aim 
to  estimate  affinity  matrices  0  and  node  attributes  F . 

In  other  words,  we  aim  to  represent  the  given  real  net¬ 
work  A  in  the  form  of  the  MAG  model  parameters: 
node  attributes  F  =  {Fa;  i  =  1,  •  •  •  ,  N,  l  =  1,  •  •  •  ,  L} 
and  attribute  affinity  matrices  ©  =  {©; ;  Z  =  1 ,  •  ■  ■  ,L}. 
MAG  yields  a  probabilistic  adjacency  matrix  that  in¬ 
dependently  assigns  the  link  probability  to  every  pair 
of  nodes,  the  likelihood  P(A\F,  0)  of  a  given  graph 
(adjacency  matrix)  A  is  the  product  of  the  edge  prob¬ 
abilities  over  the  edges  and  non-edges  of  the  network: 

p(a\f,q)=  n  Pij  n  (i-pn)  (2) 

Aij  =  l  Aij=  0 

and  p^  is  defined  in  Eq.  (JT]) . 

Now  we  can  use  the  maximum  likelihood  estimation  to 
find  node  attributes  F  and  their  affinity  matrices  0. 
Hence,  ideally  we  would  like  to  solve 

arg  max  P(A\F,  0) .  (3) 


However,  there  are  several  challenges  with  this  prob- 


Figure  2:  MAG  model:  Node  attributes  Fu  are  sam¬ 
pled  from  /i;  and  combined  with  affinity  matrices  0; 
to  generate  a  probabilistic  adjacency  matrix  P. 

lem  formulation.  First,  notice  that  Eq.  ©  is  a  combi¬ 
natorial  problem  of  0(LN )  categorical  variables  even 
when  the  affinity  matrices  0  are  fixed.  Finding  both  F 
and  0  simultaneously  is  even  harder.  Second,  even  if 
we  could  solve  this  combinatorial  problem,  the  model 
has  a  lot  of  parameters  which  may  cause  high  variance. 

To  resolve  these  challenges,  we  consider  a  simple  gen¬ 
erative  model  for  the  node  attributes.  We  assume 
that  the  £th  attribute  of  each  node  is  drawn  from  an 
i.i.d.  Bernoulli  distribution  parameterized  by /q .  This 
means  that  the  1-th  attribute  of  every  node  takes  value 
1  with  probability  Hu  i.e.,  Fu  ~  Bernoulli  (hi)- 

Figure [2] illustrates  the  model  in  plate  notation.  First, 
node  attributes  Fu  are  generated  by  the  corresponding 
Bernoulli  distributions  hi-  By  combining  these  node 
attributes  with  the  affinity  matrices  0; ,  the  probabilis¬ 
tic  adjacency  matrix  P  is  formed.  Network  A  is  then 
generated  by  a  series  of  coin  flips  where  each  edge  A^ 
appears  with  probability  Pij . 

Even  this  simplified  model  provably  generates  net¬ 
works  with  power-law  degree  distributions,  small  di¬ 
ameter,  and  unique  giant  component  [5],  The  simpli¬ 
fied  model  requires  only  5 L  parameters  (4  per  each  0;, 
1  per  hi).  Note  that  the  number  of  attributes  L  can 
be  thought  of  as  constant  or  slowly  increasing  in  the 
number  of  nodes  N  (e.g.,  L  =  O(logTV))  [2j|5|. 

The  generative  model  for  node  attributes  slightly  mod¬ 
ifies  the  objective  function  in  Eq.  ©.  We  maintain  the 
maximum  likelihood  approach,  but  instead  of  directly 
finding  attributes  F  we  now  estimate  parameters  hi 
that  then  generate  latent  node  attributes  F. 

We  denote  the  log-likelihood  log  P(A\h,  0)  as  C(h,  0) 
and  aim  to  find  h  =  {/q}  and  0  =  {0;}  by  maximizing 

Lin,  0)  =  log  P(A\h,  0)  =  log  p  (A  F\lJ-.  0)  ■ 

F 

Note  that  since  h  and  0  are  linked  through  F  we 
have  to  sum  over  all  possible  instantiations  of  node 
attributes  F.  Since  F  consists  of  L  ■  N  binary  vari¬ 
ables,  the  number  of  all  possible  instantiations  of  F 
is  0(2ln),  which  makes  computing  £(/x,  0)  directly 


intractable.  In  the  next  section  we  will  show  how  to 
quickly  (but  approximately)  compute  the  summation. 

To  compute  likelihood  P  (A,F\h,  0),  we  have  to  con¬ 
sider  the  likelihood  of  node  attributes.  Note  that  each 
edge  Aij  is  independent  given  the  attributes  F  and 
each  attribute  Fu  is  independent  given  the  parame¬ 
ters  Hi  ■  By  this  conditional  independence  and  the  fact 
that  both  Aij  and  Fu  follow  Bernoulli  distributions 
with  parameters  and  hi  we  obtain 

P(A,  F\h ,  0)  =  P(A\F,  h ,  Q)P(F\h,  0) 

=  P(A\F,  Q)P(F\h) 

=  n Pij  n (i_Pb) n«n w 

Ay  =  1  Ay=  0  Fil—O  Fi  1=1 

where  is  defined  in  Eq.  © . 

3  MAG  Parameter  Estimation 

Now,  given  a  network  A,  we  aim  to  estimate  the  pa¬ 
rameters  hi  °f  the  node  attribute  model  as  well  as  the 
attribute  affinity  matrices  0;.  We  regard  the  actual 
node  attribute  values  F  as  latent  variables  and  use 
the  expectation  maximization  framework. 

We  present  the  approximate  method  to  solve  the 
problem  by  developing  a  variational  Expectation- 
Maximization  (EM)  algorithm.  We  first  derive  the 
lower  bound  Cq(h,  0)  on  the  true  log-likelihood 
£(/i,  0)  by  introducing  the  variational  distribution 
Q{F)  parameterized  by  variational  parameters  (f>. 
Then,  we  indirectly  maximize  C(h,  0)  by  maximiz¬ 
ing  its  lower  bound  Cq(h,@)-  In  the  E-step,  we  es¬ 
timate  Q{F)  by  maximizing  £q(/x,  0)  over  the  varia¬ 
tional  parameters  </>.  In  the  M-step,  we  maximize  the 
lower  bound  £q(/x,  0)  over  the  MAG  model  parame¬ 
ters  ( h  and  0)  to  approximately  maximize  the  actual 
log-likelihood  C(h,  ©)•  We  alternate  between  E-  and 
M-steps  until  the  parameters  converge. 

Variational  EM.  Next  we  introduce  the  distribution 
Q{F)  parameterized  by  variational  parameters  <t>.  The 
idea  is  to  define  an  easy-to-compute  Q(F)  that  allows 
us  to  compute  the  lower-bound  £q(/x,  0)  of  the  true 
log-likelihood  £(//,  0).  Then  instead  of  maximizing 
the  hard-to-compute  £,  we  maximize  Cq. 

We  now  show  that  in  order  to  make  the  gap  between 
the  lower-bound  Cq  and  the  original  log  likelihood  C 
small  we  should  find  the  easy-to-compute  Q{F)  that 
closely  approximates  P(F\A,  h,  0).  For  now  we  keep 
Q(F)  abstract  and  precisely  define  it  later. 

We  begin  by  computing  the  lower  bound  Cq  in  terms 
of  Q(F).  We  plug  Q(F)  into  £(/x,  0)  as  follows: 


C{p,  6)  =  log  Yj  P{A,  F\n,  0) 

F 

=  log  ^Q(F)-P(A,F|m,0) 


=  log  E  Q 


Q(F ) 
P{A,  F\p,  0) 


Q(£) 


(5) 


As  log  a;  is  a  concave  function,  by  Jensen’s  inequality, 


log  E  Q 


P(A,F\n,S) 


Q(F) 

Therefore,  by  taking 


>  EQ 


log 


P(A,F  |/a,0) 


Q(F) 


CQ(p,&)  =EQ[logP(A,F\p,e)  ^logQ(F)}  ,  (6) 

Cq{p1  0)  becomes  the  lower  bound  on  C(p,  0). 

Now  the  question  is  how  to  set  Q{F)  so  that  we 
make  the  gap  between  Cq  and  £  as  small  as  possi¬ 
ble.  The  lower  bound  Cq  is  tight  when  the  proposal 
distribution  Q(F)  becomes  close  to  the  true  poste¬ 
rior  distribution  P(F\A,p,  0)  in  the  KL  divergence. 
More  precisely,  since  P(A\p,  0)  is  independent  of  F, 
£(/■<,©)  =  log  P(A\p,  0)  =  Eq  [log  P(A\fx,  0)].  Thus, 
the  gap  between  C  and  Cq  is 

£0,0)  -  CQ((i,  0) 

=  log  P(A\p,  0)  -  E Q  [log  P{A,  F\fj,,e)~  log  Q(F)\ 
=  E Q  [log  P(A\p,  0)  -  log  P(A,  F |/u,0)  +  log  Q(F)} 
=  E q  [log  P(F\A,  p,  0)  —  log  Q(F)]  , 


which  means  that  the  gap  between  C  and  Cq  is  exactly 
the  KL  divergence  between  the  proposal  distribution 
Q(F)  and  the  true  posterior  distribution  P(F\A,  p,  0). 

Now  we  know  how  to  choose  Q(F)  to  make  the  gap 
small.  We  want  Q(F)  that  is  easy-to-compute  and  at 
the  same  time  closely  approximates  P(F\A,  ^ i ,  0).  We 
propose  the  following  Q{F)  parameterized  by  <j>: 


Fn  ~  Bernoulli{(j)ii ) 

Qu(Fu)  =  <j>uFil  (1  -fa)1"*'" 

Q(F)  =  l[Qu(Fil)  (7) 

i,l 


where  (j>  =  {(f>u}  are  variational  parameters  and  F  = 
{Fn}.  Our  Q{F)  has  several  advantages.  First,  the 
computation  of  Cq  for  fixed  model  parameters  /i  and 
0  is  tractable  because  logP(A,F\p,Q)  —  log  Q(F)  in 
Eq.  ©  is  separable  in  terms  of  Fti .  This  means  that  we 
are  able  to  update  each  in  turn  to  maximize  Cq  by 
fixing  all  the  other  parameters:  p,  0  and  all  <j>  except 
the  given  (f>u.  Furthermore,  since  each  (f>u  represents 
the  approximate  posterior  distribution  of  Fu  given  the 
network,  we  can  estimate  each  attribute  Fu  by  <pu . 


Regularization  by  mutual  information.  In  or¬ 
der  to  improve  the  robustness  of  MAG  parameter  es¬ 
timation  procedure,  we  enforce  that  each  attribute  is 
independent  of  others.  The  maximum  likelihood  es¬ 
timation  cannot  guarantee  the  independence  between 
the  node  attributes  and  so  the  solution  might  converge 
to  local  optima  where  the  attributes  are  correlated. 
To  prevent  this,  we  add  a  penalty  term  that  aims  to 
minimize  the  mutual  information  (i.e.,  maximize  the 
entropy)  between  pairs  of  attributes. 


Since  the  distribution  for  each  attribute  Fu  is  defined 
by  (f>u ,  we  define  the  mutual  information  between  a 
pair  of  attributes  in  terms  of  <f>.  We  denote  this  mutual 
information  as  MI(F)  =  MI;;'  where  MI;;/  rep¬ 

resents  the  mutual  information  between  the  attributes 
l  and  V .  We  then  regularize  the  log- likelihood  with  the 
mutual  information  term.  We  arrive  to  the  following 
MagFit  optimization  problem  that  we  actually  solve 

arg  max  CQ  (/x,  0)  —  A  MI;;/  .  (8) 


We  can  quickly  compute  the  mutual  information  MI;;/ 
between  attributes  l  and  l' .  Let  FV.n  denote  a  random 
variable  representing  the  value  of  attribute  l.  Then, 
the  probability  P(F{.q  =  x)  that  attribute  l  takes 
value  x  is  computed  by  averaging  Qu{x)  over  i.  Sim¬ 
ilarly,  the  joint  probability  P(Fp;}  =  x,  P{-;/}  =  y)  of 
attributes  l  and  V  taking  values  x  and  y  can  be  com¬ 
puted  given  Q(F).  We  compute  MI;;/  using  Qu  defined 
in  Eq.  ©  as  follows: 


pi(x)  :=  P(F{.1}  =x)  = 

i 

Pw  (x,  y)  ■=  P(F{.1}  =  x,  P{.;/}  =  y)  =  jj^2,Qu{x)Qiv{y) 


MI,,/ 


pu'(x’y)  los 

%,y£{  o.i} 


/  Pw(x,  y)  \ 

\pi(x)pi>{y) ) 


(9) 


The  MagFit  algorithm.  To  solve  the  regularized 
MagFit  problem  in  Eq.  ©,  we  use  the  EM  algorithm 
which  maximizes  the  lower  bound  Cq(p,  0)  regular¬ 
ized  by  the  mutual  information.  In  the  E-step,  we 
reduce  the  gap  between  the  original  likelihood  C(p,  0) 
and  its  lower  bound  £g(/r,  0)  as  well  as  minimize  the 
mutual  information  between  pairs  of  attributes.  By 
fixing  the  model  parameters  /i  and  0,  we  update  </>,; 
one  by  one  using  a  gradient-based  method.  In  the 
M-step,  we  then  maximize  £q(/x,  0)  by  updating  the 
model  parameters  p  and  0.  We  repeat  E-  and  M-steps 
until  all  the  parameters  <j>,  p,  and  0  converge.  Next 
we  briefly  overview  the  E-  and  the  M-step.  We  give 
further  details  in  Appendix. 

Variational  E-Step.  In  the  E-step,  we  consider 
model  parameters  p  and  0  as  given  and  we  aim  to  find 


Algorithm  1  MagFit-VarEStep(A,  /j,,  0) 
Initialize  q =  {4>u  :  i  =  1,  ■  ■  ■  ,  N,  l  =  1,  •  •  •  ,  L} 


for  t  <—  0  to  T  —  1  do 


Select  S  C  with  |S|  =  B 
Pi?  G  5  do 


for 


Compute  -st — 


0 


aMI 
for  l'  ^  l  do 
Compute 


aMI,, 


aMI  , 

end  for 

/(t+i)  . 

Til 

end  for 
end  for 


aMI  ,  aMI,, 

'  dtku 


<Pi?  +  V( 


dCr 


-  A 


aMI  * 


Algorithm  2  MagFit-VarMStep(()>,  G,  0(o)) 

for  l  <—  1  to  L  do 

W  t—  37  Si  fin 

end  for 

for  t  t—  0  to  T  —  1  do 
for  l  t —  1  to  L  do 

©f+Vefi+nVe,^ 

end  for 
end  for 


the  values  of  variational  parameters  (f>  that  maximize 
Cq{h,  0)  as  well  as  minimize  the  mutual  information 
MI(F).  We  use  the  stochastic  gradient  method  to  up¬ 
date  variational  parameters  (j).  We  randomly  select  a 
batch  of  entries  in  (f)  and  update  them  by  their  gra¬ 
dient  values  of  the  objective  function  in  Eq.  (0).  We 
repeat  this  procedure  until  parameters  </>  converge. 

First,  by  computing  and  we  obtain  the  gra¬ 
dient  (£q(ag  0)  —  A*MI(F))  (see  Appendix  for  de¬ 
tails).  Then  we  choose  a  batch  of  <j>u  at  random  and 
update  them  by  —  A^S  in  each  step.  The  mutual 

information  regularization  term  typically  works  in  the 
opposite  direction  of  the  likelihood.  Intuitively,  the 
regularization  prevents  the  solution  from  being  stuck 
in  the  local  optimum  where  the  node  attributes  are 
correlated.  Algorithm  [T|  gives  the  pseudocode. 

Variational  M-Step.  In  the  E-step,  we  intro¬ 
duced  the  variational  distribution  Q(F)  parameter¬ 
ized  by  cj)  and  approximated  the  posterior  distribution 
P(F\A,  n,  0)  by  maximizing  £q(aj,  0)  over  cj).  In  the 
M-step,  we  now  fix  Q(F),  i.e.,  fix  the  variational  pa¬ 
rameters  <j),  and  update  the  model  parameters  /r  and 
0  to  maximize  Cq. 

First,  in  order  to  maximize  £q(/z,  0)  with  respect  to  fi, 
we  need  to  maximize  £ w  =  JT  Eq,,  [log  P(Fu \/m)]  for 


each  fii-  By  definitions  in  Eq.  0  and  0,  we  obtain 


^  H~  (1  Mii))  ■ 

i 


Then  £w  is  maximized  when 


dfM 


YJ<Pu~N  =  0 


where  m  =  fal- 

Second,  to  maximize  £q(/z,  0)  with  respect  to  0j,  we 
maximize  £©  =  Eq  [log  P(A,  F\n,  0)  —  log  (2(F)].  We 
first  obtain  the  gradient 

V0i£e  =  Ev0iEqm  [log  P^F^,©)]  (10) 


and  then  use  a  gradient-based  method  to  optimize 
£q(a«,  0)  with  regard  to  0;.  Algorithm  0 gives  details 
for  optimizing  £q(/x,  0)  over  /i  and  0. 


Speeding  up  MagFit.  So  far  we  described  how  to 
apply  the  variational  EM  algorithm  to  MAG  model  pa¬ 
rameter  estimation.  However,  both  E-step  and  M-step 
are  infeasible  when  the  number  of  nodes  N  is  large.  In 
particular,  in  the  E-step,  for  each  update  of  <t>u,  we 
have  to  compute  the  expected  log-likelihood  value  of 
every  entry  in  the  i-th  row  and  column  of  the  adja¬ 
cency  matrix  A.  It  takes  O(LN)  time  to  do  this,  so 
overall  0(L2N2)  time  is  needed  to  update  all  <j>u.  Sim¬ 
ilarly,  in  the  M-step,  we  need  to  sum  up  the  gradient 
of  0;  over  every  pair  of  nodes  (as  in  Eq.  ©)■  There¬ 
fore,  the  M-step  requires  0(LN2)  time  and  so  it  takes 
0{L2N2)  to  run  a  single  iteration  of  EM.  Quadratic 
dependency  in  the  number  of  attributes  L  and  the 
number  of  nodes  N  is  infeasible  for  the  size  of  the 
networks  that  we  aim  to  work  with  here. 


To  tackle  this,  we  make  the  following  observation. 
Note  that  both  Eq.  (fTOli  and  computation  of  in¬ 
volve  the  sum  of  expected  values  of  the  log- likelihood 
or  the  gradient.  If  we  can  quickly  approximate  this 
sum  of  the  expectations,  we  can  dramatically  reduce 
the  computation  time.  As  real-world  networks  are 
sparse  in  a  sense  that  most  of  the  edges  do  not  exist 
in  the  network,  we  can  break  the  summation  into  two 
parts  —  a  fixed  part  that  “pretends”  that  the  network 
has  no  edges  and  the  adjustment  part  that  takes  into 
account  the  edges  that  actually  exist  in  the  network. 

For  example,  in  the  M-step  we  can  separate  Eq.  GOD 
into  two  parts,  the  first  term  that  considers  an  empty 
graph  and  the  second  term  that  accounts  for  the  edges 
that  actually  occurred  in  the  network: 

Ve,£e-X)VeIEgiJ  [log  P(0|P,  F„  0)] 

+  J2  Ve,E0lJ  [log  P(l\Fi,  F^Q)- log  P(0|P,  F*,  0)]  . 


(11) 


Now  we  approximate  the  first  term  that  computes  the 
gradient  pretending  that  the  graph  A  has  no  edges: 

^V0iEQiij  [log P(0\Fi,Fj,  0)] 
hj 

=  Ve,EQijJ2logP(0iFi,Fj,e)] 

ij 

~  V0iEQ.  .  [N(N-  l)EF[logP(O|F,0)]] 

=  V0I2V(1V  -  l)EF[logP(O|F,0)] .  (12) 

Since  each  Fu  follows  the  Bernoulli  distribution  with 
parameter  /x;.  Eq.  (fl2l)  can  be  computed  in  O(L)  time. 
As  the  second  term  in  Eq.  CD  requires  only  0(LE ) 
time,  the  computation  time  of  the  M-step  is  reduced 
from  0(LN2)  to  O(LE).  Similarly  we  reduce  the  com¬ 
putation  time  of  the  E-step  from  0(L2N2)  to  0(L2E) 
(see  Appendix  for  details).  Thus  overall  we  reduce 
the  computation  time  of  MagFit  from  0{L2N2)  to 
0{L2E). 


4  Experiments 


Having  introduced  the  MAG  model  estimation  proce¬ 
dure  MagFit,  we  now  turn  our  attention  to  evaluat¬ 
ing  the  fitting  procedure  itself  and  the  ability  of  the 
MAG  model  to  capture  the  connectivity  structure  of 
real  networks.  There  are  three  goals  of  our  experi¬ 
ments:  (1)  evaluate  the  success  of  MagFit  param¬ 
eter  estimation  procedure;  (2)  given  a  network,  infer 
both  latent  node  attributes  and  the  affinity  matrices 
to  accurately  model  the  network  structure;  (3)  given 
a  network  where  nodes  already  have  attributes,  infer 
the  affinity  matrices.  For  each  experiment,  we  proceed 
by  describing  the  experimental  setup  and  datasets. 


Convergence  of  MagFit.  First,  we  briefly  evalu¬ 
ate  the  convergence  of  the  MagFit  algorithm.  For 
this  experiment,  we  use  synthetic  MAG  networks  with 
N  =  1024  and  L  =  4.  Figure  3(a)  illustrates  that 
the  objective  function  Cq,  i.e.,  the  lower  bound  of 
the  log-likelihood,  nicely  converges  with  the  number 
of  EM  iterations.  While  the  log-likelihood  converges, 
the  model  parameters  /x  and  0  also  nicely  converge. 
Figure  3(b)  shows  convergence  of  /ii, . . . ,  /14,  while 
Fig.  3(c)  shows  the  convergence  of  entries  0;[O,  0]  for 
l  =  1, ...  ,4.  Generally,  in  100  iterations  of  EM,  we 
obtain  stable  parameter  estimates. 


We  also  compare  the  runtime  of  the  fast  MagFit  to 
the  naive  version  where  we  do  not  use  speedups  for  the 
algorithm.  Figure[3(d)|shows  the  runtime  as  a  function 
of  the  number  of  nodes  in  the  network.  The  runtime  of 
the  naive  algorithm  scales  quadratically  0(N 2),  while 
the  fast  version  runs  in  near-linear  time.  For  example, 
on  4,000  node  network,  the  fast  algorithm  runs  about 
100  times  faster  than  the  naive  one. 
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Figure  3:  Parameter  convergence  and  scalability. 

Based  on  these  experiments,  we  conclude  that  the  vari¬ 
ational  EM  gives  robust  parameter  estimates.  We  note 
that  the  MagFit  optimization  problem  is  non-convex, 
however,  in  practice  we  observe  fast  convergence  and 
good  fits.  Depending  on  the  initialization  MagFit 
may  converge  to  different  solutions  but  in  practice  so¬ 
lutions  tend  to  have  comparable  log-likelihoods  and 
consistently  good  fits.  Also,  the  method  nicely  scales 
to  networks  with  up  to  hundred  thousand  nodes. 

Experiments  on  real  data.  We  proceed  with  ex¬ 
periments  on  real  datasets.  We  use  the  Linkedln  so¬ 
cial  network  [8]  at  the  time  in  its  evolution  when  it 
had  N  =  4,096  nodes  and  E  =  10,052  edges.  We  also 
use  the  Yahoo!- Answers  question  answering  social  net¬ 
work,  again  from  the  time  when  the  network  had  N  = 
4,096,  E  =  5,678  0-  For  our  experiments  we  choose 
L  =  11,  which  is  roughly  log  A  as  it  has  been  shown 
that  this  is  the  optimal  choice  for  L  [5J. 

Now  we  proceed  as  follows.  Given  a  real  network  A ,  we 
apply  MagFit  to  estimate  MAG  model  parameters  0 
and  /x.  Then,  given  these  parameters,  we  generate  a 
synthetic  network  A  and  compare  how  well  synthetic 
A  mimics  the  real  network  A. 

Evaluation.  To  measure  the  level  of  agreement  be¬ 
tween  synthetic  A  and  the  real  A ,  we  use  several  dif¬ 
ferent  metrics.  First,  we  evaluate  how  well  A  captures 
the  structural  properties,  like  degree  distribution  and 
clustering  coefficient,  of  the  real  network  A.  We  con¬ 
sider  the  following  network  properties: 

•  In/Out-degree  distribution  (InD/OutD)  is  a  his¬ 
togram  of  the  number  of  in-coming  and  out-going 
links  of  a  node. 

•  Singular  values  (SVal)  indicate  the  singular  values 
of  the  adjacency  matrix  versus  their  rank. 

•  Singular  vector  (SVec)  represents  the  distribution 


of  components  in  the  left  singular  vector  associ¬ 
ated  with  the  largest  singular  value. 

•  Clustering  coefficient  ( CCF )  represents  the  degree 
versus  the  average  (local)  clustering  coefficient  of 
nodes  of  a  given  degree  [15: . 

•  Triad  participation  (TP)  indicates  the  number  of 
triangles  that  a  node  is  adjacent  to.  It  measures 
the  transitivity  in  networks. 

Since  distributions  of  the  above  quantities  are  gener¬ 
ally  heavy-tailed,  we  plot  them  in  terms  of  comple¬ 
mentary  cumulative  distribution  functions  (P{X  >  x ) 
as  a  function  of  x).  Also,  to  indicate  the  scale,  we  do 
not  normalize  the  distributions  to  sum  to  1. 

Second,  to  quantify  the  discrepancy  of  network 
properties  between  real  and  synthetic  networks,  we 
use  a  variant  of  Kolmogorov-Sminorv  (KS)  statistic 
and  the  L2  distance  between  different  distribu¬ 
tions.  The  original  KS  statistics  is  not  appropriate 
here  since  if  the  distribution  follows  a  power-law 
then  the  original  KS  statistics  is  usually  domi¬ 
nated  by  the  head  of  the  distribution.  We  thus 
consider  the  following  variant  of  the  KS  statistic: 
KS(D\ ,  D2 )  =  maxj  |  log D±(x)  —  log D2 (x) |  0 ,  where 
Di  and  D 2  are  two  complementary  cumulative  distri¬ 
bution  functions.  Similarly,  we  also  define  a  variant 
of  the  L2  distance  on  the  log-log  scale,  L2(Di,D2)  = 

logfc-loga  (fa  (log  Dl(x)  ~  \ogD2(x)f  d(  log  ®)) 

where  [a,  b]  is  the  support  of  distributions  D\  and  D2. 
Therefore,  we  evaluate  the  performance  with  regard 
to  the  recovery  of  the  network  properties  in  terms  of 
the  KS  and  L2  statistics. 

Last,  since  MAG  generates  a  probabilistic  adjacency 
matrix  P,  we  also  evaluate  how  well  P  represents  a 
given  network  A.  We  use  the  following  two  metrics: 

•  Log-likelihood  (LL)  measures  the  possibility  that 
the  probabilistic  adjacency  matrix  P  generates 
network  A:  LL  =  log(P^  (1  -  PlJ)1~Ai^). 

•  True  Positive  Rate  Improvement  (TPI)  represents 

the  improvement  of  the  true  positive  rate  over  a 
random  graph:  TPI  =  Pijjjpi-  TPI  in¬ 

dicates  how  much  more  probability  mass  is  put 
on  the  edges  compared  to  a  random  graph  (where 
each  edge  occurs  with  probability  E/N2). 

Recovery  of  the  network  structure.  We  begin 
our  investigations  of  real  networks  by  comparing  the 
performance  of  the  MAG  model  to  that  of  the  Kro- 
necker  graphs  model  [5],  which  offers  a  state  of  the 
art  baseline  for  modeling  the  structure  of  large  net¬ 
works.  We  use  evaluation  methods  described  in  the 
previous  section  where  we  fit  both  models  to  a  given 
real-world  network  A  and  generate  synthetic  Amag 
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Figure  4:  The  recovered  network  properties  by  the 
MAG  model  and  the  Kronecker  graphs  model  on  the 
Linkedln  network.  For  every  network  property,  MAG 
model  outperforms  the  Kronecker  graphs  model. 

and  Aaron ■  Then  we  compute  the  structural  proper¬ 
ties  of  all  three  networks  and  plot  them  in  Figure  [4] 
Moreover,  for  each  of  the  properties  we  also  compute 
KS  and  L2  statistics  and  show  them  in  Table  Q] 

Figure  [I]  plots  the  six  network  properties  described 
above  for  the  Linkedln  network  and  the  synthetic 
networks  generated  by  fitting  MAG  and  Kronecker 
models  to  the  Linkedln  network.  We  observe  that 
MAG  can  successfully  produce  synthetic  networks  that 
match  the  properties  of  the  real  network.  In  particular, 
both  MAG  and  Kronecker  graphs  models  capture  the 
degree  distribution  of  the  Linkedln  network  well.  How¬ 
ever,  MAG  model  performs  much  better  in  matching 
spectral  properties  of  graph  adjacency  matrix  as  well 
as  the  local  clustering  of  the  edges  in  the  network. 

Table  [l]  shows  the  KS  and  L2  statistics  for  each  of  the 
six  structural  properties  plotted  in  Figure  [4j  Results 
confirm  our  previous  visual  inspection.  The  MAG 
model  is  able  to  fit  the  network  structure  much  bet¬ 
ter  than  the  Kronecker  graphs  model.  In  terms  of  the 
average  KS  statistics,  we  observe  43%  improvement, 
while  observe  even  greater  improvement  of  70%  in  the 
L2  metric.  For  degree  distributions  and  the  singular 
values,  MAG  outperforms  Kronecker  for  about  25% 


Table  1:  KS  and  L2  of  MAG  and  the  Kronecker  graphs  Table  2:  LL  and  TPI  values  for  Linkedln  (LI)  and 
model  on  the  Linkedln  network.  MAG  exhibits  50-70%  Yahool-Answers  ( YA)  networks 


better  performance  than  Kronecker  graphs  model. 


KS 

InD 

OutD 

SVal 

SVec 

TP 
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MAG 

3.70 

3.80 

0.84 

2.43 

3.87 

3.16 

2.97 
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4.00 
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6.14 
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the  links  between  the  nodes  of  the  periphery. 

Turning  our  attention  back  to  MAG  and  Kronecker 
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models,  we  note  that  real-world  networks  globally  ex¬ 
hibit  nested  core-periphery  structure  [9]  (Figure  [3(c)). 
While  there  exists  the  core  (densely  connected)  and 
the  periphery  (sparsely  connected)  part  of  the  net¬ 
work,  there  is  another  level  of  core-periphery  struc¬ 
ture  inside  the  core  itself.  On  the  other  hand,  if 
viewing  the  network  more  finely,  we  may  also  ob¬ 


(a)  Homophily  (b)  Heterophily  (c)  Core-Periphery 

Figure  5:  Structures  in  which  a  node  attribute  can 
affect  link  affinity.  The  widths  of  arrows  correspond 
to  the  affinities  towards  link  formation. 


serve  the  homophily  which  produces  local  community 
structure.  MAG  can  model  both  global  core-periphery 
structure  and  local  homophily  communities,  while  the 
Kronecker  graphs  model  cannot  express  the  different 
affinity  types  because  it  uses  only  one  initiator  matrix. 


while  the  improvement  on  singular  vector,  triad  par¬ 
ticipation  and  clustering  coefficient  is  60  ~  75%. 

We  make  similar  observations  on  the  Yahool-Answers 
network  but  omit  the  results  for  brevity.  We  include 
them  in  Appendix. 

We  interpret  the  improvement  of  the  MAG  over  Kro¬ 
necker  graphs  model  in  the  following  way.  Intuitively, 
we  can  think  of  Kronecker  graphs  model  as  a  version  of 
the  MAG  model  where  all  affinity  matrices  0;  are  the 
same  and  all  fii  =  0.5.  However,  real-world  networks 
may  include  various  types  of  structures  and  thus  dif¬ 
ferent  attributes  may  interact  in  different  ways.  For 
example,  Figure  [3  shows  three  possible  linking  affini¬ 
ties  of  a  binary  attribute.  Figure  [3(a)  shows  a  ho¬ 
mophily  (love  of  the  same)  attribute  affinity  and  the 
corresponding  affinity  matrix  0.  Notice  large  values  on 
the  diagonal  entries  of  0,  which  means  that  link  prob¬ 
ability  is  high  when  nodes  share  the  same  attribute 
value.  The  top  of  each  figure  demonstrates  that  there 
will  be  many  links  between  nodes  that  have  the  value 
of  the  attribute  set  to  “0”  and  many  links  between 
nodes  that  have  the  value  “1”,  but  there  will  be  few 
links  between  nodes  where  one  has  value  “0”  and  the 
other  “1”.  Similarly,  Figure  [3(b)  shows  a  heterophily 
(love  of  the  different)  affinity,  where  nodes  that  do  not 
share  the  value  of  the  attribute  are  more  likely  to  link, 
which  gives  rise  to  near-bipartite  networks.  Last,  Fig- 
ure[3(c)  shows  a  core-periphery  affinity,  where  links  are 
most  likely  to  form  between  “0”  nodes  (i.e.,  members 
of  the  core)  and  least  likely  to  form  between  “1”  nodes 
(i.e.,  members  of  the  periphery).  Notice  that  links  be¬ 
tween  the  core  and  the  periphery  are  more  likely  than 


For  example,  the  Linkedln  network  consists  of  4  core¬ 
periphery  affinities,  6  homophily  affinities,  and  1  het¬ 
erophily  affinity  matrix.  Core-periphery  affinity  mod¬ 
els  active  users  who  are  more  likely  to  connect  to  oth¬ 
ers.  Homophily  affinities  model  people  who  are  more 
likely  to  connect  to  others  in  the  same  job  area.  In¬ 
terestingly,  there  is  a  heterophily  affinity  which  results 
in  bipartite  relationship.  We  believe  that  the  relation¬ 
ships  between  job  seekers  and  recruiters  or  between 
employers  and  employees  leads  to  this  structure. 

TPI  and  LL.  We  also  compare  the  LL  and  TPI  val¬ 
ues  of  MAG  and  Kronecker  models  on  both  Linkedln 
and  Yahool-Answers  networks.  Table  [2]  shows  that 
MAG  outperforms  Kronecker  graphs  by  surprisingly 
large  margin.  In  LL  metric,  the  MAG  model  shows 
50  ~  60  %  improvement  over  the  Kronecker  model. 
Furthermore,  in  TPI  metric,  the  MAG  model  shows 
23  ~  35  times  better  accuracy  than  the  Kronecker 
model.  From  these  results,  we  conclude  that  the  MAG 
model  achieves  a  superior  probabilistic  representation 
of  a  given  network. 

Case  Study:  AddHealth  network.  So  far  we  con¬ 
sidered  node  attributes  as  latent  and  we  inferred  the 
affinity  matrices  0  as  well  as  the  attributes  themselves. 
Now,  we  consider  the  setting  where  the  node  attributes 
are  already  given  and  we  only  need  to  infer  affinities  0. 
Our  goal  here  is  to  study  how  real  attributes  explain 
the  underlying  network  structure. 

We  use  the  largest  high-school  friendship  network 
(N  =  457,  E  =  2,259)  from  the  National  Longitudinal 
Study  of  Adolescent  Health  (AddHealth)  dataset.  The 
dataset  includes  more  than  70  school-related  attributes 
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Figure  6:  Properties  of  the  AddHealth  network. 


for  each  student.  Since  some  attributes  do  not  take  bi¬ 
nary  values,  we  binarize  them  by  taking  value  1  if  the 
value  of  the  attribute  is  less  than  the  median  value. 
Now  we  aim  to  investigate  which  attributes  affect  the 
friendship  formation  and  how. 


We  set  L  =  7  and  consider  the  following  methods  for 
selecting  a  subset  of  7  attributes: 

•  R7:  Randomly  choose  7  real  attributes  and  fit  the 
model  (i.e.,  only  fit  0  as  attributes  are  given). 

•  L 7:  Regard  all  7  attributes  as  latent  (i.e.,  not 
given)  and  estimate  /r/  and  0;  for  l  =  1, ...  ,7. 

•  FT.  Forward  selection.  Select  attributes  one  by 
one.  At  each  step  select  an  additional  attribute 
that  maximizes  the  overall  log-likelihood  (i.e.,  se¬ 
lect  a  real  attribute  and  estimate  its  0/). 

•  F5+L2:  Select  5  real  attributes  using  forward  se¬ 
lection.  Then,  we  infer  2  more  latent  attributes. 

To  make  the  MagFit  work  with  fixed  real  attributes 
(i.e.,  only  infer  0)  we  fix  (j>u  to  the  values  of  real  at¬ 
tributes.  In  the  E-step  we  then  optimize  only  over  the 
latent  set  of  (fin  and  the  M-step  remains  as  is. 

AddHealth  network  structure.  We  begin  by  eval¬ 
uating  the  recovery  of  the  network  structure.  Figure  [6] 
shows  the  recovery  of  six  network  properties  for  each 
attribute  selection  method.  We  note  that  each  method 
manages  to  recover  degree  distributions  as  well  as  spec¬ 
tral  properties  (singular  values  and  singular  vectors) 
but  the  performance  is  different  for  clustering  coeffi¬ 
cient  and  triad  participation. 

Table  [3]  shows  the  discrepancies  in  the  6  network  prop¬ 
erties  (KS  and  L2  statistics)  for  each  attribute  selec¬ 
tion  method.  As  expected,  selecting  7  real  attributes 


Table  3:  Performance  of  different  selection  methods. 


KS 

InD 

OutD 

SVal 

SVec 

TP 

CCF 

Avg 

R7 

1.00 

0.58 

0.48 

2.92 

4.52 

4.45 

2.32 

F7 

2.32 

2.80 

0.30 

2.68 

2.60 

1.58 

2.05 

F5+L2 

3.45 

4.00 

0.26 

0.95 

1.30 

3.45 

2.24 

L7 

1.58 

1.58 

0.18 

2.00 

2.67 

2.66 

1.78 

L2 


R7 

0.25 

0.16 

0.25 

0.96 

3.18 

1.74 

1.09 

F7 

0.71 

0.67 

0.18 

0.98 

1.26 

0.78 

0.76 

F5+L2 

0.80 

0.87 

0.13 

0.34 

0.76 

1.30 

0.70 

L7 

0.29 

0.27 

0.10 

0.64 

0.75 

1.22 

0.54 

Table  4:  LL  and  TPI  for  the  AddHealth  network. 


R7 

F7 

F5+L2 

L7 

LL 

-13651 

-12161 

-12047 

-9154 

TPI 

1.0 

1.1 

1.9 

10.0 

at  random  (R7)  performs  the  worst.  Naturally,  L7  per¬ 
forms  the  best  (23%  improvement  over  R7  in  KS  and 
50%  in  L2 )  as  it  has  the  most  degrees  of  freedom.  It 
is  followed  by  F5+L2  (the  combination  of  5  real  and  2 
latent  attributes)  and  F7  (forward  selection). 

As  a  point  of  comparison  we  also  experimented  with 
a  simple  logistic  regression  classifier  where  given  the 
attributes  of  a  pair  of  nodes  we  aim  to  predict  an  oc¬ 
currence  of  an  edge.  Basically,  given  network  A  on  JV 
nodes,  we  have  N2  (one  for  each  pair  of  nodes)  train¬ 
ing  examples:  E  are  positive  (edges)  and  N2  —  E  are 
negative  (non-edges).  However,  the  model  performs 
poorly  as  it  gives  50%  worse  KS  statistics  than  MAG. 
The  average  KS  of  logistic  regression  under  R7  is  3.24 
(vs.  2.32  of  MAG)  and  the  same  statistic  under  F7 
is  3.00  (vs.  2.05  of  MAG).  Similarly,  logistic  regres¬ 
sion  gives  40%  worse  L2  under  R7  and  50%  worse 
L2  under  F7.  These  results  demonstrate  that  using 
the  same  attributes  MAG  heavily  outperforms  logistic 
regression.  We  understand  that  this  performance  dif¬ 
ference  arises  because  the  connectivity  between  a  pair 
of  nodes  depends  on  some  factors  other  than  the  linear 
combination  of  their  attribute  values. 

Last,  we  also  examine  the  LL  and  TPI  values  and 
compare  them  to  the  random  attribute  selection  R7 
as  a  baseline.  Table  |4]  gives  the  results.  Somewhat 
contrary  to  our  previous  observations,  we  note  that 
F7  only  slightly  outperforms  R7,  while  F5+L2  gives  a 
factor  2  better  TPI  than  R7.  Again,  L7  gives  a  factor 
10  improvement  in  TPI  and  overall  best  performance. 

Attribute  affinities.  Last,  we  investigate  the  struc¬ 
ture  of  attribute  affinity  matrices  to  illustrate  how 
MAG  model  can  be  used  to  understand  the  way  real 
attributes  interact  in  shaping  the  network  structure. 
We  use  forward  selection  (F7)  to  select  7  real  attributes 
and  estimate  their  affinity  matrices.  Table  [5]  reports 
first  5  attributes  selected  by  the  forward  selection. 


Table  5:  Affinity  matrices  of  5  AddHealth  attributes. 


Affinity  matrix 

Attribute  description 

[0.572  0.146;  0.146  0.999] 

School  year  (0  if  >  2) 

[0.845  0.332;  0.332  0.816] 

Highest  level  math  (0  if  >  6) 

[0.788  0.377;  0.377  0.784] 

Cumulative  GPA  (0  if  >  2.65) 

[0.999  0.246;  0.246  0.352] 

AP/IB  English  (0  if  taken) 

[0.794  0.407;  0.407  0.717] 

Foreign  language  (0  if  taken) 

First  notice  that  AddHealth  network  is  undirected 
graph  and  that  the  estimated  affinity  matrices  are  all 
symmetric.  This  means  that  without  a  priori  biasing 
the  fitting  towards  undirected  graphs,  the  recovered 
parameters  obey  this  structure.  Second,  we  also  ob¬ 
serve  that  every  attribute  forms  a  homophily  struc¬ 
ture  in  a  sense  that  each  student  is  more  likely  to  be 
friends  with  other  students  of  the  same  characteristic. 
For  example,  people  are  more  likely  to  make  friends 
of  the  same  school  year.  Interestingly,  students  who 
are  freshmen  or  sophomore  are  more  likely  (0.99)  to 
form  links  among  themselves  than  juniors  and  seniors 
(0.57).  Also  notice  that  the  level  of  advanced  courses 
that  each  student  takes  as  well  as  the  GPA  affect  the 
formation  of  friendship  ties.  Since  it  is  difficult  for  stu¬ 
dents  to  interact  if  they  do  not  take  the  same  courses, 
the  chance  of  the  friendships  may  be  low.  We  note 
that,  for  example,  students  that  take  advanced  place¬ 
ment  (AP)  English  courses  are  very  likely  to  form 
links.  However,  links  between  students  who  did  not 
take  AP  English  are  nearly  as  likely  as  links  between 
AP  and  non-AP  students.  Last,  we  also  observe  rel¬ 
atively  small  effect  of  the  number  of  foreign  language 
courses  taken  on  the  friendship  formation. 

5  Conclusion 

We  developed  MagFit,  a  scalable  variational  expec¬ 
tation  maximization  method  for  parameter  estimation 
of  the  Multiplicative  Attribute  Graph  model.  The 
model  naturally  captures  interactions  between  node 
attributes  and  the  network  structure.  MAG  model 
considers  nodes  with  categorical  attributes  and  the 
probability  of  an  edge  between  a  pair  of  nodes  depends 
on  the  product  of  individual  attribute  link  formation 
affinities.  Experiments  show  that  MAG  reliably  cap¬ 
tures  the  network  connectivity  patterns  as  well  as  pro¬ 
vides  insights  into  how  different  attributes  shape  the 
structure  of  networks.  Venues  for  future  work  include 
settings  where  node  attributes  are  partially  missing 
and  investigations  of  other  ways  to  combine  individ¬ 
ual  attribute  linking  affinities  into  a  link  probability. 
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A  Variational  EM  Algorithm 


function  of  one  specific  parameter  (fu  and  differenti¬ 
ate  this  function  over  <t>u.  For  convenience,  we  denote 
E—ii  —  {Ejk  •  j  7^  L  k  and  Q—u  —  1  Qjk- 

Note  that  J2fu  Qu(fu)  =  1  and  ^F-U  Q-u(F-u)  be¬ 
cause  both  are  the  sums  of  probabilities  of  all  possible 
events.  Therefore,  we  can  separate  Cq(p,  0)  in  Eq.  © 
into  the  terms  of  Qu(Fu )  and  Q-u(F-u): 


Fq(p,  0) 

=  E o  [log  P(A,  F\n,  0)  —  log  Q(F)\ 

=  E  Q  (f)  (lo§  p  f\f  0)  -  lo§  Q  ( F )) 

F 

Q-u{F-u)Qu  (Fu) 

F-u  Fa 

x  (log  P  (A,  F\p,  0)  -  log  Qu  (Fu)  -  log  Q-u  (F-u)) 
=  E  &»(*«)  E<2-«  (F_u)\ogP(A,F\n,Q) 

Fa  \F-a 

-  E  Qu(Fu)  l°g Qu(Fu) 

Fa 

-  E  Q-u(F-u)  logQ-u(F-u) 

F-u 

=  E^(F«)E0-«  [logP(A,F|M,0)] 

Fa 

+  H(Qa)+n(Q-u)  (13) 


In  Section  [2j  we  proposed  a  version  of  MAG  model  by 
introducing  a  generative  Bernoulli  model  for  node  at¬ 
tributes  and  formulated  the  problem  to  solve.  In  the 
following  Section  [3l  we  gave  a  sketch  of  MagFit  that 
used  the  variational  EM  algorithm  to  solve  the  prob¬ 
lem.  Here  we  provide  how  to  compute  the  gradients  of 
the  model  parameters  (<f>,  p,  and  0)  for  the  of  E-step 
and  M-step  that  we  omitted  in  Section  [3j  We  also  give 
the  details  of  the  fast  MagFit  in  the  following. 


where  %(P)  represents  the  entropy  of  distribution  P. 

Since  we  compute  the  gradient  of  (fu ,  we  regard 
the  other  variational  parameter  (j)-u  as  a  con¬ 
stant  so  'H(Q-u)  is  also  a  constant.  Moreover,  as 
Fq_u  [l°gP  (A,  F\p,  0)]  integrates  out  all  the  terms 
with  regard  to  <j>-u,  it  is  a  function  of  Fu .  Thus, 
for  convenience,  we  denote  Eq_.,  [logP  (A,  F\p ,  0)]  as 
log P^  (Fu).  Then,  since  Fu  follows  a  Bernoulli  distri¬ 
bution  with  parameter  (fu,  by  Eq.  m 


A.l  Variational  E-Step 

In  the  E-step,  the  MAG  model  parameters  p  and  0 
are  given  and  we  aim  to  find  the  optimal  variational 
parameter  (f  that  maximizes  Cq(p,  0)  as  well  as  min¬ 
imizes  the  mutual  information  factor  MI(P).  We  ran¬ 
domly  select  a  batch  of  entries  in  (f  and  update  the  se¬ 
lected  entries  by  their  gradient  values  of  the  objective 
function  Cq(p,  0).  We  repeat  this  updating  procedure 
until  (j>  converges. 

In  order  to  obtain  (Cq(p,  0)  —  AMI(P)),  we  com¬ 
pute  ^2.  and  in  turn  as  follows. 

Computation  of  To  calculate  the  partial 

derivative  we  begin  by  restating  Cq(p,Q)  as  a 


Fq(p,0)  =  (1  -  (fu)  (log Pu  (1)  log(l  <fu 

+  (fu  (log  Pu  (0)  -  log  (fu  j  +  const .  (14) 


Note  that  both  Pu  (0)  and  Pu  (1)  are  constant.  There¬ 
fore, 


Aa  =  logAM_logAW. 

d(fu  (fu  K  1  -  (fu 


(15) 


To  complete  the  computation  of  ^2. ;  now  we  focus  on 

the  value  of  Pu  (Fu)  for  Fu  =  0, 1.  By  Eq.  |4])  and  the 
linearity  of  expectation,  logP;;  (Fu)  is  separable  into 


small  tractable  terms  as  follows: 

log  Pa  (Fu)  =  E Q_a  [logP(A,  F|/x,  0)] 

=  [iogP(Auv\Fu,Fv,Q)] 

U,V 

+  J2^Q-u[^ogP(FuM}  (16) 

u,k 

where  B\  =  {Fu  :  l  =  1,2,  •••  ,L}.  However,  if 
u,v  ±  i,  then  E Q_u  [log  P(AUV\FU,  Fv,  0)]  is  a  con¬ 
stant,  because  the  average  over  Q-u(F-u)  integrates 
out  all  the  variables  Fu  and  Fv.  Similarly,  if  u  ^  i 
and  k  ^  /,  then  E q_u  [log P(Fuk\/Jik)]  is  a  constant. 
Since  most  of  terms  in  Eq.  m  are  irrelevant  to  (fin, 
log Pu  (Fu)  is  simplified  as 

log  Pi  (FU)  =  ^Eq_4I  [logP^lP,^-,©)] 

+  ^>Q_a  \^°&P(Aji\Fj,  Fi,  0)]  J 

+  \ogP(Fu\ni)  +  C  (17) 

for  some  constant  C. 

By  definition  of  P(Fu\m)  in  Eq.  QJ,  the  last  term  in 
Eq.  (IT7[)  is 

log  P(Fu  \ni)  =  Fu  log ni  +  (l  —  Fu)  log(l  —  nu)  ■  (18) 


With  regard  to  the  first  two  terms  in  Eq.  ED- 

logP(Hy-|p,E,,0)  =  \ogP(Aji\Fi,Fj,QT) . 

Hence,  the  methods  to  compute  the  two  terms  are 
equivalent.  Thus,  we  now  focus  on  the  computation 
of  Eq_4!  [log P(Aij \FU  F, :,©)]- 


Since  log  P(Aij\Fi,Fj1Q)  is  not  separable  in 
terms  of  0*,,  it  takes  0(22L)  time  to  compute 
Eq_  [log  P(Auj  | Fj, ,  F; ,  0)]  exactly.  We  can  reduce 
this  computation  time  to  O(L)  by  applying  Taylor’s 
expansion  of  log(l  —  x)  «  —  x  —  1 x 2  for  small  x: 


Eq.,  [logP(Aij  =  0|Fj, Fj, 0)] 

»  E q.u  - 1]  ®k[Fik,Fjk)  -  \  []  Q2k[Fik,  Fjk\ 

k  k 

=  -Eq.,  [0;[Pp^]]nEQ^  [QklFik,Fjk]} 

k^l 

-  ^Eq.,  [©?[**,,  F*]]  nEQ^  H[Fik,Fjk}] 

k^l 

(21) 


where  each  term  can  be  computed  by 


EQii  [Yl[FihFjl]]  =  faMlFu,  0]  +  (1  -  faimFu,  1] 

E Qik,jk  [Pk  [P?!fc  •  Fj k ] ]  [0zfc  kfijk\  '  Yk  '  [1  tfiik  1  *fijk\ 

for  any  matrix  Yi,Yk  €  R2x2. 

In  brief,  for  fixed  i  and  l,  we  first  compute 
Eq_41  [log  P(Aij\F,.  Fj.  0)]  for  each  node  j  depend¬ 
ing  on  whether  or  not  i  —¥  j  is  an  edge.  By  adding 
log  P (Fu  \ni),  we  then  acheive  the  value  of  log  Pu  (Fu) 
for  each  Fu .  Once  we  have  log  Pu  (Fu ) ,  we  can  finally 
compute 

Scalable  computation.  However,  as  we  analyzed  in 
Section [3J  the  above  E-step  algorithm  requires  O(LN) 
time  for  each  computation  of  so  that  the  to¬ 
tal  computation  time  is  0(L2N2),  which  is  infeasible 
when  the  number  of  nodes  N  is  large. 


First,  in  case  of  Hq  =  1,  by  definition  of  P(Aq |Fj ,  Fj) 
in  Eq.  (@J, 

Eq_4!  [log P(Aij  =  l|Fi,P,-,0)] 


=  E 


Q-u 


y^log  Ok[Fjk,Fjk] 

_  k 


=  Eq,-,  [log Oi[Fu,Fjt]]  +  EE<2^.^  [log0fc[Fifc,F.,fc]] 

k^l 

=  EQ.i[log  Ql(Fu,Fjl\)+C'  (19) 


for  some  constant  C'  where  Qik,jk(Fik,  Fjk)  = 
Qik(Fik)Q jk(Fjk) i  because  EQ4fe^.fc  [log  ©^[Ft/c,  F,'^]]  is 
constant  for  each  k. 


Second,  in  case  of  Ajj  =  0, 

P(Aij  =  0|Fi,  Fj,  0)  =  1  -  ll  Qk(Fik,Fjk] .  (20) 

k 


Here  we  propose  the  scalable  algorithm  of  computing 
by  further  approximation.  As  described  in  Sec¬ 
tion  [3[  we  quickly  approximate  the  value  of  as 
if  the  network  would  be  empty,  and  adjust  it  by  the 
part  where  edges  actually  exist.  To  approximate 
in  empty  network  case,  we  reformulate  the  first  term 
in  Eq.  ([171) : 

E  E<3-«!  [log  P(Aij\Fi,Fj,Q)]  =  ^E  Q-u  V°gP{0\Fi,Fj,  0)] 

3  3 

+  EQ_il[logP(l|Pi,Fj,0)-logP(O|Fi,Fj,0)] 

Au=  1 

(22) 

However,  since  the  sum  of  i.i.d.  random  variables  can 
be  approximated  in  terms  of  the  expectaion  of  the  ran¬ 
dom  variable,  the  first  term  in  Eq.  (1221)  can  be  approx- 


imated  as  follows: 


EE Q-u  pogiW,  *},©)] 


j 


=  E 


Q-u 


Y,^P(0\Fi,Fjte) 


«  E Q_a  [(N  -  l)EFj  [logP(0|FI;  Fj,  0)]] 
=  (iV-l)EF.[logP(O|Pi)Pi,0)] 


(23) 


As  Fji  marginally  follows  a  Bernoulli  distribution  with 
pi,  we  can  compute  Eq.  (E3l)  by  using  Eq.  (Oil)  in  0(L) 
time.  Since  the  second  term  of  Eq.  (E^ll  takes  O(LNi) 
time  where  N,  represents  the  number  of  neighbors  of 
node  i,  Eq.  Cfil)  takes  only  O(LNi)  time  in  total.  As 
in  the  E-step  we  do  this  operation  by  iterating  for  all 
i’s  and  Z’s,  the  total  computation  time  of  the  E-step 
eventually  becomes  0(L2E),  which  is  feasible  in  many 
large-scale  networks. 

Computation  of  ^r-y .  Now  we  turn  our  attention  to 
the  derivative  of  the  mutual  information  term.  Since 
MI(P)  =  MI w,  we  can  separately  compute  the 

derivative  of  each  term  •  By  definition  in  Eq.  © 

and  Chain  Rule, 

dMIn/  _  dpw  (x,  y)  pw(x,y ) 

d^il  ~  x,y7{0,l}  d(t>il  0g  Pl(x)Pl'(y) 

dpw  {x,  y)  Pw{x,y)  dpi{x)  pw(x,y)  dpv(y) 

d(f>ii  pi{x)  dcj>u  pi'{y)  d(j>u 

(24) 


The  values  of  Pw(x,y),  Pi{x ),  and  pv(y)  are  defined 
in  Eq.  ©.  Therefore,  in  order  to  compute  g^  ,  we 
need  the  values  of  dpigjxt’v^ ,  dlg^  ,  and  ^g^  ■  By 
definition  in  Eq.  m, 


dpwjx,  y) 
d<f>u 

dpi(x )  = 

d(f>u 


=  Qw{y) 


dQu 


d<t>u 


dpi-  (y) 
d(t>u 


where  =  1  and 


1=1  —  — 1- 


Since  all  terms  in 

d<Pu 

ally  compute  ^yy. 


are  tractable,  we  can  eventu- 


A.2  Variational  M-Step 

In  the  E-Step,  with  given  model  parameters  p  and  0, 
we  updated  the  variational  parameter  </>  to  maximize 


Cq(p,  0)  as  well  as  to  minimize  the  mutual  informa¬ 
tion  between  every  pair  of  attributes.  In  the  M-step, 
we  basically  fix  the  approximate  posterior  distribution 
Q(F),  i.e.  fix  the  variational  parameter  cf>,  and  update 
the  model  parameters  p  and  0  to  maximize  £q(p,  0). 

To  reformulate  Cq(p,Q)  by  Eq.  J4]), 


£q(p  ©) 

=  EQ[logP(A,P|M,0)-  log  Q(F)] 


=  Eq 


J2P(Aj\Fi,Fj,Q)  +  Y,p(Fu\pi) 


+  n(Q) 


=  E  Eq«,3  [log P(Aij\Fi,  Fj ,  0)] 
ij 


+E 


i 


EE Q«  [log  P(FM] 


+  n(Q) 


(25) 


where  Qij(F{i.},  F{j.})  represents  ff/  Qu{Fu)Qji{Fji)- 

After  all,  Cq(p,Q)  in  Eq.  (1251)  is  divided  into  the 
following  terms:  a  function  of  0,  a  function  of  pi, 
and  a  constant.  Thus,  we  can  exclusively  update  p 
and  0.  Since  we  already  showed  how  to  update  p 
in  Section  [3j  here  we  focus  on  the  maximization  of 
Ce  =  Eq  [log  P(A,  F\p ,  0)  -  log  Q(F)]  using  the  gra¬ 
dient  method. 

Computation  of  Ve,£e.  To  use  the  gradient 
method,  we  need  to  compute  the  gradient  of  Cq : 


VQlCe  Ve^Q,,.  [log  P(Aij \F, ,  Fj ,  0)]  .  (26) 
ij 


We  separately  calculate  the  gradient  of  each  term  in 
Cq  as  follows:  For  every  Z\ ,  z2  £  {0, 1},  if  A,;;-  =  1, 


9Eq.  ■[logP(A^-|Ei,E?-,0)] 

dOi[zi,z2\ 


d 

dQi[z1,z2] 


E  f 


E  l°g  ©fc  [Eife,  Fjk\ 

.  k 


JtpTTZ  TTEQ»j  [log Qi[Fii,Fji]] 
0'Ol[Zl,Z2\ 

Qu(zi)Qji  (^2) 

©J  [zi,  z2] 


(27) 


On  the  contrary,  if  A,j  =  0,  we  use  Taylor’s  expansion 


as  used  in  Eq.  (ED: 


5Eq.  .  [logPjA^F^Q)] 
dOi  [z!,z2] 


ae, 


■Ef 


J&k[Fik,Fjk] 


2  n  ® k  J  Fjk] 


=  -Quiz^Qji^Yl^Qikjk  [®k[Fik,Fjk]\ 

k^l 


-  Qu(zi)Qji{z2)Qk[zi,  z2}Y[FQiktjk  [el[Fik,Fjk\] 

k^l 

(28) 


where  Qu,ji(Fu,  Fjt)  =  Qli{Fil)Qji(F:ji). 
Since 


FQzkijk  [/(©)]  =  Qik(zi)Qjk(z2)f  (Q[zi,z2]) 

Zl,Z2 


for  any  function  /  and  we  know  each  function  values 
of  Qii(Fu)  in  terms  of  (pu .  we  are  able  to  achieve  the 
gradient  V0i£0  by  Eq.  (l26l)  ~  (l28l). 

Scalable  computation.  The  M-step  requires  to  sum 
0(N 2)  terms  in  Eq.  (l26ll  where  each  term  takes  0(L ) 
time  to  compute.  Similarly  to  the  E-step,  here  we 
propose  the  scalable  algorithm  by  separating  Eq.  (l26l) 
into  two  parts,  the  fixed  part  for  an  empty  graph  and 
the  adjustment  part  for  the  actual  edges: 


In-degree  Out-degree 


(a)  In-degree 


(b)  Out-degree 


(c)  Singular  value  (d)  Singular  vector 


(e)  Clustering  coefficient 


#  of  triads 

(f)  Triad  participation 


Figure  7:  The  recovered  network  properties  by  the 
MAG  model  and  the  Kronecker  graphs  model  on  the 
Yahool-Answers  network.  For  every  network  property, 
MAG  model  outperforms  the  Kronecker  graphs  model. 


V0i£e  =^V0IEQ.3.  [log  P(0|.Fi,  Fj,  0)] 


B  Experiments 


+  £  Vq.Eq^.  [logP(l|Fj,  Fj,  0)  —  logP(0|Fj,  Fj,  0)1  . 

a  i  B.l  Yahool-Ansers  Network 

(29) 


We  are  able  to  approximate  the  first  term  in  Eq.  (1291) , 
the  value  for  the  empty  graph  part,  as  follows: 


^V0iEQiij  [log  P(0\Fi,  Fj,  0)] 
id 


=  ve;EQ: 


<3 


^logPm.E,-,©) 

id 


«  V0iEQi|3.  [N(N  -  l)EF[logP(0|F,  0)]] 

=  Ve,iV(iV-l)EF[logP(O|F,0)].  (30) 


Since  each  Fu  marginally  follows  the  Bernoulli  distri¬ 
bution  with  /q,  Eq.  (13U1)  is  computed  by  Eq.  (E51)  in 
O(L)  time.  As  the  second  term  in  Eq.  (l29l)  requires 
only  0{LE )  time,  the  computation  time  of  the  M-step 
is  finally  reduced  to  0(LE )  time. 


Here  we  add  some  experimental  results  that  we  omit¬ 
ted  in  Section  [4]  First,  Figure  [3  compares  the  six 
network  properties  of  Yahool-Answers  network  and 
the  synthetic  networks  generated  by  MAG  model  and 
Kronecker  graphs  model  fitted  to  the  real  network. 
The  MAG  model  in  general  shows  better  performance 
than  the  Kronecker  graphs  model.  Particularly,  the 
MAG  model  greatly  outperforms  the  Kronecker  graphs 
model  in  local-clustering  properties  (clustering  coeffi¬ 
cient  and  triad  participation). 

Second,  to  quantify  the  recovery  of  the  network  prop¬ 
erties,  we  show  the  KS  and  L2  statistics  for  the  syn¬ 
thetic  networks  generated  by  MAG  model  and  Kro¬ 
necker  graphs  model  in  Table  [G]  Through  Table  [G] 
we  can  confirm  the  visual  inspection  in  Figure  [D  The 
MAG  model  shows  better  statistics  than  the  Kronecker 
graphs  model  in  overall  and  there  is  huge  improvement 
in  the  local-clustering  properties. 


Table  6:  KS  and  L2  for  MAG  and  Kronecker  model 
fitted  to  Yahool-Answers  network 


KS 

InD 

OutD 

SVal 

SVec 

TP 

CCF 

Avg 

MAG 

3.00 

2.80 

14.93 

13.72 

4.84 

4.80 

7.35 

Kron 

2.00 

5.78 

13.56 

15.47 

7.98 

7.05 

8.64 

L2 


MAG 

0.96 

0.74 

0.70 

6.81 

2.76 

2.39 

2.39 

Kron 

0.81 

2.24 

0.69 

7.41 

6.14 

4.73 

3.67 

Table  7:  KS  and  L2  for  logistic  regression  methods 
fitted  to  AddHealth  network 


KS 

InD 

OutD 

SVal 

SVec 

TP 

CCF 

Avg 

R7 

2.00 

2.58 

0.58 

3.03 

5.39 

5.91 

3.24 

F7 

1.59 

1.59 

0.52 

3.03 

5.43 

5.91 

3.00 

L2 


R7 

0.54 

0.58 

0.29 

1.09 

3.43 

2.42 

1.39 

F7 

0.42 

0.24 

0.27 

1.12 

3.55 

2.09 

1.28 

B.2  AddHealth  Network 

We  briefly  mentioned  the  logistic  regression  method  in 
AddHealth  network  experiment.  Here  we  provide  the 
details  of  the  logistic  regression  and  full  experimental 
results  of  it. 

For  the  variables  of  the  logistic  regression,  we  use  a  set 
of  real  attributes  in  the  AddHealth  network  dataset. 
For  such  set  of  attributes,  we  used  F7  (forward  selec¬ 
tion)  and  R7  (random  selection)  defined  in  Section  |4] 
Once  the  set  of  attributes  is  fixed,  we  come  up  with  a 
linear  model: 

p(i  ■  \  =  cxp(c  +  aiFu  +  Ez  PiFji) 

1  +  exp(c  +  Yh  onFu  +  PiFji)  ' 


Table  [3  shows  the  KS  and  L2  statistics  for  logistic 
regression  methods  under  R7  and  F7  attribute  sets. 
It  seems  that  the  logistic  regression  succeeds  in  the 
recovery  of  degree  distributions.  However,  it  fails  to 
recover  the  local-clustering  properties  (clustering  coef¬ 
ficient  and  triad  participation)  for  both  sets. 


